WIP expand Variables class to handle s3 urls from NSIDC #434

rwegener2 · 2023-08-01T14:34:55Z

What was done

At the moment the Variables class .avail() method would fail if given an s3 url. This PR expands the class to handle this case.

How it was done

In addition to file and order a new vartype, nsidc-s3, was added to the class. nsidc was specified (instead of just s3) because extracting the product and version from the filepath relies on the nsidc naming structure. So, you couldn't just give Variables the s3 path to any IS2 file anywhere on AWS; it needs to be from the nsidc-cumulus-prod-protected bucket. If others have feedback on this decision I'm happy to hear. We could alternately try to extract the product and version from the metadata, which would remove this limitation.

Within the .avail() method the nsidc-s3 filetype uses the same variable extraction strategy as order. An interesting note about this is that if we use the order method of grabbing variables it happens very quickly. It's the time to ping an API endpoint and parse the request. The .avail() function gets very slow with cloud data when trying to walk the full file (as happens when using the file argument), but we can avoid that by using the order method of extracting variables instead.

Todo / Next steps

This PR is pretty close, but blocked by determining how to pass authentication (#433). After auth is addressed this PR can be merged into the getvars branch, where work has already started changing the way the Variables class is called in Read and Query.

add test case
update after authentication PR
add parse as a library requirement

github-actions · 2023-08-01T14:35:12Z

👈 Launch a binder notebook on this branch for commit 703d832

I will automatically update this comment whenever this PR is modified

👈 Launch a binder notebook on this branch for commit 61b2006

👈 Launch a binder notebook on this branch for commit ba52c55

weiji14 · 2023-08-08T19:59:04Z

icepyx/core/variables.py

@@ -72,6 +74,14 @@ def __init__(
        elif self._vartype == "file":
            # DevGoal: check that the list or string are valid dir/files
            self.path = path
+        elif self._vartype == "nsidc-s3":
+            # Grab metadata from s3 path
+            template = ('s3://nsidc-cumulus-prod-protected/ATLAS/{product}/{version}/' 


This template looks to be hardcoded to the non-gridded ATLAS datasets like ATL06. Other gridded products (e.g. ATL14, ATL16, ATL20) would have a different template like s3://nsidc-cumulus-prod-protected/ATLAS/{product}/{version}/{year}/{filename}' . E.g. looking at https://search.earthdata.nasa.gov/search?ff=Available%20in%20Earthdata%20Cloud&fi=ATLAS&gdf=HDF&fst0=Cryosphere&lat=65.08299573518599&long=-25.69921875&zoom=5:

Dataset Sample path

ATL06 s3://nsidc-cumulus-prod-protected/ATLAS/ATL06/006/2023/04/16/ATL06_20230416235213_04061911_006_02.h5

ATL07 s3://nsidc-cumulus-prod-protected/ATLAS/ATL07/005/2022/10/12/ATL07-02_20221012220720_03391701_005_01.h5

ATL10 s3://nsidc-cumulus-prod-protected/ATLAS/ATL10/005/2022/10/12/ATL10-01_20221012220720_03391701_005_01.h5

ATL11 s3://nsidc-cumulus-prod-protected/ATLAS/ATL11/005/2022/03/27/ATL11_006305_0315_005_03.h5

ATL14 s3://nsidc-cumulus-prod-protected/ATLAS/ATL14/002/2019/ATL14_IS_0314_100m_002_01.nc

ATL16 s3://nsidc-cumulus-prod-protected/ATLAS/ATL16/004/2022/ATL16_20220722003637_04601601_004_01.h5

ATL20 s3://nsidc-cumulus-prod-protected/ATLAS/ATL20/003/2022/ATL20-01_20220901002201_10861601_003_01.h5

Would it be possible to generalize this code to both non-gridded and gridded products?

Ah, thanks for catching this @weiji14! I'll work on a fix and let you know when I'm ready for another review!

A question for @JessicaS11 and @weiji14 -- What do we think of getting the version and product name from inside the file instead of parsing it from the filename? I've only checked a handful of products, but those fields seem to be available in top-level metadata in a consistent way. I've been trying to parse those things out of the filename, which is how I believe it is also done elsewhere in the module, but this limits the files icepyx can process to those named in a very specific way. If we grab product/version from inside the file we are able to process more files (ex. cloud icesat-2 files not in nsidc bucket, or local files that have had their name changed). Thoughts?

I realized that the place I was thinking about this last was in the branch to remove intake from icepyx. I just pushed a WIP PR (#438) so there is a place to discuss questions. Hopefully whatever we decide there about accessing the product/version from that can be used for this PR later.

The discussion summarized in #438 (comment) indicates our intention to move away from requiring the user provide the product as input (unless they are also feeding in a directory containing files from multiple products). This should address the template issues noted here.

Co-authored-by: Jessica Scheick <[email protected]>

…move_auth

Co-authored-by: Jessica Scheick <[email protected]>

Co-authored-by: Wei Ji <[email protected]>

Co-authored-by: Jessica Scheick <[email protected]>

rwegener2 · 2023-10-23T21:22:40Z

This PR prompted two prior PRs: #444 and #451. I think those updates will are exciting for enabling the goal of this PR, allowing the Variables class to list s3 data variables. The changes in those two prior have made the approach to this goal is quite different than when this PR was opened (it was more complicated back then!). As a result I'm going to close this PR and open a new one to pursue adding s3 url reads to the Variables class.

rwegener2 added 2 commits July 31, 2023 18:53

wip inheritance method for modularizing authentication

57ac538

add nsidc_s3 option to Variables class

703d832

rwegener2 added 3 commits August 1, 2023 17:07

mvp remove intake from Read

9d09ff9

outline of mixin method of authentication

4564b3b

add s3 credential timer and auth check

16b8d3f

weiji14 reviewed Aug 8, 2023

View reviewed changes

rwegener2 and others added 23 commits August 9, 2023 11:12

Update icepyx/core/variables.py

54aeda0

Co-authored-by: Jessica Scheick <[email protected]>

Update icepyx/core/query.py

083426c

Co-authored-by: Jessica Scheick <[email protected]>

add docstrings to auth.py

76d3c96

Merge branch 'move_auth' of https://github.com/icesat2py/icepyx into …

edfa362

…move_auth

add comment to stop tests from running docstring on build

e8c9060

fix user warning for giving an email parameter

72c5347

add tests for auth module

434bbf2

add warning message for use of earthdata_login

7e8bf0f

remove .netrc creation and update existing tests to new auth method

e06765a

undo changes to troubleshoot build

9e1f745

another baby commit to figure out what is breaking travis

a2a455f

remove duplicate netrc creation

1aacb1d

update documentation for new auth procedure

50db05e

remove earthdata_login function from docstrings

f44ead3

remove missed instance of earthdata_login in docs

12e21ba

attempt add auth to API reference

10bc734

Update icepyx/core/auth.py

8033ed4

Co-authored-by: Jessica Scheick <[email protected]>

alphabetize ordering

f1fa0df

add warning to dev log

6cdddbf

add auth to components docs

b7b8b7e

add more detail to auth string

97fda07

add authentication explainer

1172e9e

add internals to index.rst

abd950a

rwegener2 and others added 26 commits August 30, 2023 20:43

update documentation for removing intake

de61d87

update approach paragraph

9f06611

remove one more instance of catalog from the docs

d019b9a

clear jupyter history

156ea89

Update icepyx/core/read.py

b26ca4e

Co-authored-by: Wei Ji <[email protected]>

remove intake and related modules

ce1ca76

Merge branch 'development' into read_arguments

fd00aeb

mvp with new read parameters

431af78

clean up remainder of file and remove extraneous comments

612662e

maintain backward compatibility and combine arguments

c16a003

update to new error message

7648078

update docs

4cfbfdb

glob kwargs and list error

f7f823b

formatting updates

203f3ad

Apply suggestions from code review

10d1591

Co-authored-by: Jessica Scheick <[email protected]>

remove num_files

0b23d1e

fix docs test typo

6f5bead

trying again to fix the build

035ee5a

add feedback to docs page

903c351

Merge branch 'development' into read_arguments

d842bde

fix typo

5e06de9

Merge branch 'development' into read_arguments

9ca29f1

Merge branch 'development' into read_arguments

e8e35ad

Merge branch 'development' into s3_variables

d26a194

Merge branch 'read_arguments' into s3_variables

af79818

resolve merge conflicts from development

ba52c55

rwegener2 closed this Oct 23, 2023

rwegener2 deleted the s3_variables branch October 23, 2023 21:23

JessicaS11 linked an issue Jan 24, 2024 that may be closed by this pull request

Make Variables an independent class #450

Open

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP expand Variables class to handle s3 urls from NSIDC #434

WIP expand Variables class to handle s3 urls from NSIDC #434

rwegener2 commented Aug 1, 2023 •

edited

Loading

github-actions bot commented Aug 1, 2023 •

edited

Loading

weiji14 Aug 8, 2023 •

edited

Loading

rwegener2 Aug 24, 2023 •

edited

Loading

rwegener2 Aug 29, 2023

rwegener2 Aug 29, 2023

JessicaS11 Sep 1, 2023

rwegener2 commented Oct 23, 2023

Dataset	Sample path
ATL06	s3://nsidc-cumulus-prod-protected/ATLAS/ATL06/006/2023/04/16/ATL06_20230416235213_04061911_006_02.h5
ATL07	s3://nsidc-cumulus-prod-protected/ATLAS/ATL07/005/2022/10/12/ATL07-02_20221012220720_03391701_005_01.h5
ATL10	s3://nsidc-cumulus-prod-protected/ATLAS/ATL10/005/2022/10/12/ATL10-01_20221012220720_03391701_005_01.h5
ATL11	s3://nsidc-cumulus-prod-protected/ATLAS/ATL11/005/2022/03/27/ATL11_006305_0315_005_03.h5
ATL14	s3://nsidc-cumulus-prod-protected/ATLAS/ATL14/002/2019/ATL14_IS_0314_100m_002_01.nc
ATL16	s3://nsidc-cumulus-prod-protected/ATLAS/ATL16/004/2022/ATL16_20220722003637_04601601_004_01.h5
ATL20	s3://nsidc-cumulus-prod-protected/ATLAS/ATL20/003/2022/ATL20-01_20220901002201_10861601_003_01.h5

WIP expand Variables class to handle s3 urls from NSIDC #434

WIP expand Variables class to handle s3 urls from NSIDC #434

Conversation

rwegener2 commented Aug 1, 2023 • edited Loading

What was done

How it was done

Todo / Next steps

github-actions bot commented Aug 1, 2023 • edited Loading

weiji14 Aug 8, 2023 • edited Loading

Choose a reason for hiding this comment

rwegener2 Aug 24, 2023 • edited Loading

Choose a reason for hiding this comment

rwegener2 Aug 29, 2023

Choose a reason for hiding this comment

rwegener2 Aug 29, 2023

Choose a reason for hiding this comment

JessicaS11 Sep 1, 2023

Choose a reason for hiding this comment

rwegener2 commented Oct 23, 2023

rwegener2 commented Aug 1, 2023 •

edited

Loading

github-actions bot commented Aug 1, 2023 •

edited

Loading

weiji14 Aug 8, 2023 •

edited

Loading

rwegener2 Aug 24, 2023 •

edited

Loading